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1.  Introduction 


In  Baek  et  al  (1994)  (subsequently  abbreviated  BGWMF)  techniques  are  given  for 
a  hypothesis-testing  approach  to  discriminant  analysis  in  which  one  wishes  to  control  one 
of  the  probabilities  of  misclassification.  Methods  are  presented  for  continuous  variables 
only,  as  well  as  for  a  mixture  of  continuous  and  categorical  variables.  Essentially,  the 
hypothesis-testing  approach  based  on  the  ratio  of  maximized  likelihood  functions 
proposed  by  Krzanowski  (1980, 1982)  is  employed  and  the  test  statistic  is  bootstrapped  in 
order  to  estimate  critical  values  for  the  allocation  rule  in  such  a  way  that  the  error  rate  is 
controlled.  In  Miller  et  al  (1993)  (subsequently  abbreviated  (MGW)),  a  similar 
hypothesis-testing  approach  is  used  for  discriminant  analysis  and  outlier  detection  in  the 
presence  of  missing  data.  The  EM  algorithm  (Dempster  et  al  (1977))  is  employed  to 
obtain  maximum  likelihood  estimates  of  model  parameters  and  compute  the  maximized 
likelihoods  based  on  the  available  data.  That  paper,  however,  only  considers  the  case  in 
which  all  variables  are  continuous  and,  in  fact,  normally  distributed. 

In  this  report,  we  wish  to  consider  the  remaining  case  in  which  we  have  a  mixture 
of  continuous  and  categorical  variables  used  as  discriminants,  and  also  missing  data, 
potentially  in  both  the  training  sets  and  in  the  new  observation  to  be  classified.  Once 
again,  we  use  a  hypothesis-testing  approach  to  classification  and  bootstrap  the  test 
statistic  in  order  to  control  the  probability  of  a  particular  type  of  misclassification.  We 
present  three  algorithms  for  handling  this  situation: 

(1)  The  INDICATOR  algorithm  -  This  algorithm  begins  by  converting  each  categorical 
variable  with  j  categories  into  j  -  1  indicator  variables.  This  results  in  a  larger 
number  of  variables  (unless  all  categorical  variables  are  already  binary,  in  which 
case  the  data  set  is  unchanged).  These  indicator  variables  can  be  analyzed  using 
techniques  for  quantitative  data.  In  this  algorithm  we  make  the  (obviously 
incorrect)  assumption  that  all  variates  are  continuous  and,  in  fact,  normally 
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distributed.  We  then  perform  discriminant  analysis  using  the  transformed  data  and 
the  techniques  of  MGW. 

(2)  The  FULL  algorithm  -  Next,  we  model  the  joint  distribution  of  each  observation  in 
the  following  manner:  Suppose  each  observation  consists  of  p  categorical  variables 
and  q  continuous  variables.  The  categorical  variables  define  r  cells  of  a  contingency 
table  in  which  the  observation  could  fall,  where  r  is  the  product  of  the  number  of 
categories  possible  within  each  categorical  variable.  We  assume  that  the 
observation  will  fall  into  cell  i  (i  =  1,  ...,  r)  with  probability  Pj,  and  that  the 
conditional  distribution  of  the  continuous  part  given  that  the  discrete  part  places  the 
observation  into  cell  i  is  multivariate  normal  with  mean  Pj  and  Ej.  We  then  employ 
the  EM  algorithm  to  obtain  maximum  likelihood  estimates  of  parameters  in  this 
model  and  compute  maximized  likelihoods  of  the  available  data,  and  bootstrap  the 
ratio  of  maximized  likelihoods,  as  was  done  in  BGWMF. 

(3)  The  COMMON  algorithm  -  This  algorithm  is  essentially  the  same  as  the  FULL 
algorithm,  except  that  we  assume  a  common  covariance  matrix  across  all 
multinomial  cells.  That  is,  the  conditional  distribution  of  the  continuous  part  given 
that  the  discrete  part  places  the  observation  into  cell  i  is  assumed  multivariate 
normal  with  mean  Pj  and  E,  with  E  no  longer  depending  on  i.  This  reduces 
considerably  the  number  of  parameters  that  need  to  be  estimated  and  makes 
possible  calculation  of  the  likelihood  ratio  statistic  when  some  cells  may  be  sparsely 
represented,  or  not  represented  at  all. 

Simulation  studies  are  conducted  to  compare  and  contrast  the  performance  of  each 
of  these  procedures  with  regard  to  their  ability  to  accurately  control  the  Type  I  error  rate, 
and  with  regard  to  their  power. 
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2.  Notation  and  Overview  of  the  Generalized  Likelihood  Ratio  Test  Procedure 


Suppose  we  wish  to  classity  a  (p+q)-dimensional  random  vector  V  into  one  of  two 
populations  Ttj  or  712-  Suppose  further  that  V  can  be  partitioned  as  V  =  (X,  Y),  where  X  = 
(Xj,  X2,  Xp)  is  a  p-dimensional  vector  of  categorical  variables  and  Y  =  (Yj,  Y2, 

Y  )  is  a  q-dimensional  vector  of  continuous  variables.  Suppose  that  for  i  =  1,  p,  the 
variable  Xj  takes  on  one  of  the  rj  possible  values  1,  2, rj.  Then  the  vector  X  takes  on 
one  of  r  =  HP^jrj  possible  values.  We  let  T  denote  the  set  of  all  possible  values  of  the 
vector  X.  Finally,  suppose  that  training  samples  {Vp^},  i  =  1, ...,  Nj  from  ttj  and  {Vp^}, 
i  =  1,  ...,  N2  from  7t2,  each  having  the  same  structure  as  V,  are  available,  and  that  data 
may  be  missing  at  random  from  any  part  of  V  or  from  the  training  samples. 

The  generalized  likelihood  ratio  test  (GLRT)  procedure  for  classifying  V  into  tIj 
or  7C2  is  based  on  a  hypothesis  testing  approach.  That  is,  the  classification  of  V  is  done  by 
testing 


Hq-  V,  Vj  ,  V2  , ...  e  7Tj,  Vj  »  V2  ,  ...  ’ ^  ^2 


versus 


(1) 


Hr  e  7t,-  V  e  tt-. 


The  two  misclassification  probabilities  that  we  will  be  interested  in  are  P(2|l)  and  P(l|2), 
where  P(i[j)  denotes  the  probability  of  classifying  V  into  tij  when  in  fact  V  e  tij.  We  will 
refer  to  a  =  P(2|l)  as  the  significance  level  for  the  test  and  P(2|2)  as  the  power. 

Let  m  denote  the  number  of  elements  in  V  that  are  missing  and  let  ¥^2)  ^  (^(2)’ 
Y(2))denote  the  (p  -  m)-variate  vector  of  available  data  in  V.  Similarly,  let  mP  denote  the 
number  of  elements  missing  from  Vp  and  let  V®  ^  denote  the  (p  -  mP)-variate  vector  of 
available  data  in  (j  =  2;  i  =  1,  2,  ...,  Nj).  We  assume  that  Ttj  has  joint  density 

function  f(V|0(^))  and  that  712  has  joint  density  function  f(V|0(^)),  where  f  is  some 
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parametric  density  function  with  parameters  0^*)  and  for  populations  7i|  and  TC2, 
respectively.  Then,  under  Hq,  the  likelihood  of  V  and  the  training  samples  is  given  by 


Loi(e(')|  V,  v<‘>,  v<'>, ...  ,v^‘J)Lo2(eO)|  vf ,  ...  .v®*), 


where 


Ni 


"01 


(0(i)|  V,  ...  =  f2(v|0n))nfii(vpi0(i)). 


Lo2(0^^^I  vf  \  vf , ...  =  nf2i(vj^^|0(2)), 


N2 

(2)s  _  TTf 


(2) 

(3) 


f2(V|0('))  is  the  marginal  density  function  for  V^2)  evaluated  at  V^2)  parameters  0(^\ 
and  fjj(VP|0^))  is  the  marginal  density  function  for  evaluated  at  v|^2)  "^i^h 
parameters  Under  Hj,  the  likelihood  of  V  and  the  training  samples  is  given  by 


Li  y\^\  ...  ,v;^‘|)Li2(0(2)|  v,  v! 


(IX 


7(2)  ,7(2) 


V(2)\ 


(4) 


where 


Ni 


Lii(0(l)|  ...  ,V^J)  =  nfji(v[*^|0(l)), 


N2 


.02(0(2)]  V,  vf ,  yf\ ...  =  f2(v|0(2))  n  f2i(vf  ^10(2)), 

2  i  =1 


and  f2(V|0(2))  is  the  marginal  density  function  for  y^2)  evaluated  at  V^2)  parameters 
0(2)  emphasize  that  these  are  the  likelihood  functions  for  the  available  data  rather 
than  the  likelihood  functions  for  the  complete  data  since  f2  and  fjj  (j  =  X  2;  i  =  1,  2,  ..., 
Nj)  are  marginal  densities  for  the  available  part  of  each  observation,  rather  than  the 
likelihood  functions  for  the  complete  data. 
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The  GLRT  procedure  is  based  on  the  ratio 


(e(i)>)ii-oi(e<'>l  V.  vj'*'  '’2^’  -  ■Vn!)Lo2(8*^*I  <'■  . 


(1). 


.(2)  .,(2) 


.(2). 


LR= 


(0(1),  0(2)) 


i)“&2),Lii(e"*l  vf.  vf, ...  ,v);;)L,j(eP)l  V,  v',"',  v)"', ...  .v^ 


(2)  ^,(2) 


.(2)x 


(5) 


^.,e<V,v<",vf.....v;;;,L„,(e«|vf>vf . 

'l„(6<'>|v<".v<‘' . v^'V^ceflvy^.v® . vpy 


where  0^^  and  0^^  are  maximum  likelihood  estimates  of  0(j)  (j  2)  under  the  null  and 
alternative  hypotheses,  respectively.  That  is,  0^^^  is  the  MLE  of  0(^'^  based  on  the  sample 
{V,  ...  ,  V^^},  0Q^^  is  the  MLE  of  0(^)  based  on  the  sample  {Vj^\  ...  , 

V^^},  and  0j*^  is  the  MLE  of  0(^)  based  on  the  sample  {Vj*\  \  ...  ,  and  0j^^  is 

the  MLE  of  0(^)  based  on  the  sample  {V,  Vj^\  ... ,  V^^}. 

Equivalently,  the  test  procedure  may  be  based  on  the  statistic 


X  —  log(LR)  —  Xq  j  +  Xq2  ■  ^1 1  ■  ^12’ 


(6) 


>-0,  =  iogf2(v|9^'>)  +  Eiogf,i(v;‘)|e<,'y  (7) 

>^2=Zi<,gf2i(v®ief), 

N) 

X,l  =  Elogf,i(v‘'>|e<y),and 

>.,2  =  iogf2(V|ef>)  +  Eiogf2i(vf>ief ). 

A  key  step  in  evaluating  X  for  a  given  sample  is  the  computation  of  the  maximum 
likelihood  estimates  and  the  corresponding  maximized  log-likelihood  functions  A,qj,  Xq2, 
Xj  j,  and  A,j2  in  equation  (7).  This  is  no  trouble  when  the  data  are  complete,  as  illustrated 

by  (BGWMF).  However,  in  the  presence  of  missing  data,  the  usual  expressions  for 
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maximum  likelihood  estimates  are  no  longer  valid.  In  this  case,  maximum  likelihood 
estimates  are  obtained  via  the  EM  algorithm  (Dempster  et  al  (1977)).  The  EM  algorithm 
is  an  iterative  procedure  for  obtaining  parameter  estimates  which  maximize  the  likelihood 
function  of  the  available  data.  It  involves  two  key  steps: 

(E  -step)  -  Using  current  estimates  0^'^^  (where  k  now  denotes  the  current  iteration  step, 
rather  than  designating  tij  or  estimate  the  values  of  the  complete  data 
sufficient  statistics  by  computing  their  expectations  given  the  available  data. 
(M-step)  -  Determine  the  values  of  the  parameters  which  maximize  the  likelihood  for  the 
complete  data  based  on  the  current  estimates  of  the  complete  data  sufficient 
statistics,  thus  yielding  0^ 

Az-Jj-n 

The  EM  algorithm  iteratively  performs  E-  and  M-steps  until  the  sequence  {0'‘  } 

converges  to  an  adequate  approximation  to  the  MLE.  To  evaluate  the  test  statistic  X  of 

^(1) 

equation  (3),  we  must  implement  the  EM  algorithm  four  times.  That  is,  0^  and  A,qj  are 
based  on  the  sample  {V,  ... ,  and  Iq2  are  based  on  the  sample 

... ,  V^^},  0^^^  and  7.J  j  are  based  on  the  sample  ... ,  and  0j^^  and 

X,j2  are  based  on  the  sample  (V,  ... , 

The  decision  rule  is  described  as  follows:  small  values  of  X  provide  evidence  in 

favor  of  Hj,  hence,  V  is  classified  into  712  if  A,  <  otherwise  v  is  classified  into  Ttj.  The 

cut-off  value  X^  is  chosen  so  that  P(2|l)  =  a,  the  desired  significance  level  for  the  test. 

Since  the  null  distribution  of  X  is  not  known,  the  critical  value  is  approximated  by  the 

parametric  bootstrap  (Efron  1979).  For  some  large  integer  B,  B  bootstrap  samples  {V*, 

...  ,  are  simulated  from  a  distribution  with  density  f(V|0^*^)  and  B 

bootstrap  samples  ... ,  are  simulated  from  a  distribution  with  density 

f(V|0^^^),  where  0^^^  and  0^^^  are  MLEs  obtained  from  the  samples  {Vj^\  \  ... , 

and  ...  ,  V^^},  respectively.  (Notice  that  in  this  case,  0^^^  =  0j*^  and  0^^^  = 

1  Z  JN2 
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When  there  are  missing  values,  the  simulated  bootstrap  samples  should  also  have 

missing  values  in  a  configuration  similar  to  that  in  the  actual  data.  For  each  bootstrap 

♦  ♦  ♦ 

sample,  the  test  statistic  X  is  computed,  thus  generating  a  random  sample  {Xj,  ... ,  Xg} 
of  variates  that  have  approximately  the  same  distribution  as  X  under  Hq.  For  an  a-level 

test,  the  cut-off  value  X^  is  chosen  as  the  a-th  empirical  quantile  of  {Xj,  X.^,  ...  ,  Xg}. 

♦  * 

Finally,  V  is  classified  into  n  j  if  X  >  X^;  V  is  classified  into  7:2  if  X  <  X^. 

As  was  pointed  out  in  (MGW),  this  test  procedure  is  only  an  approximation  to  the 
true  GLRT  procedure  since  the  critical  value  is  obtained  via  bootstrapping  and  we  may 
further  relax  our  approximation  to  the  true  GLRT  procedure  by  relaxing  the  number  of 
iterations  performed  by  the  EM  algorithm.  That  is,  we  may  choose  a  stopping  criterion 
for  the  EM  algorithm  that  does  not  continue  iteration  until  convergence  has  been  obtained 
to  a  high  degree  of  accuracy.  Whatever  the  stopping  criterion,  bootstrapping  the  test 
statistic  insures  an  approximate  a-level  test.  As  in  (MGW),  it  would  appear  that  very 
little  power  is  lost  by  only  performing  a  very  few  iterations  of  the  EM  algorithm,  as 
opposed  to  carrying  out  iterations  until  MLEs  are  obtained  with  a  high  degree  of 
accuracy.  In  Section  6,  we  often  use  only  three  iterations  as  standard  practice  in  our 
simulation  studies. 

Our  implementation  of  the  GLRT  procedure  for  discriminant  analysis  is 
summarized  in  Figure  1.  Figures  2,  3,  and  4  further  describe  the  bootstrapping  module, 
the  computation  of  the  test  statistic  X,  and  the  EM  algorithm  for  obtaining  MLEs.  Each 
of  the  various  algorithms  discussed  in  this  report  share  this  common  skeletal  structure. 
The  differences  lie  in  the  type  of  model  being  assumed  for  the  data,  the  corresponding 
implementation  of  the  EM  algorithm  for  obtaining  MLEs,  and  the  precise  formulas  used 
to  evaluate  the  maximized  log-likelihood  functions. 
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The  INDICATOR  Algorithm 

In  our  first  attempt  to  implement  an  algorithm  for  discriminant  analysis  for  mixed 
categorical  and  continuous  variables  with  missing  data,  we  desired  to  use  the  methods 
presented  in  MGW  with  as  little  adaptation  as  possible.  One  way  to  do  this  would  be  to 
treat  the  categorical  variables  as  if  they  were  continuous  and  use  the  procedures  of  MGW 
without  any  alteration  at  all.  This  is  perhaps  not  such  a  bad  idea  if  categorical  variables 
have  a  large  number  of  categories,  if  these  categories  have  a  natural  ordering,  and  if  the 
distribution  of  this  variable  has  a  somewhat  normal  shape.  In  most  cases,  however,  these 
conditions  are  not  satisfied  and  the  procedure  would  be  totally  inappropriate. 

A  modification  to  the  above  approach  is  to  replace  each  categorical  variable  with 
indicator  variables  in  the  following  manner:  We  replace  each  categorical  variable  Xj 
(i  =  1, ... ,  p)  from  V  with  the  rj  - 1  indicator  variables 


Wjj  I(Xi  j)  {o  otherwise  ^  j  1,  2, ... ,  rj  - 1). 


(8) 


Hence,  the  vector  X  of  categorical  variables  gets  replaced  by  a  vector  W  of  binary 
variables  of  length  SP^jrj  -  p,  producing  the  transformed  vector  V  =  (W,  Y).  If  Xj  is 
missing  in  X,  then  each  Wjj  Q  =  I,  2,  ...  ,  rj  -  1)  is  missing  in  W.  We  transform  the 
training  samples  (j  =  1.  2;  i  =  1,  2,  ...  ,  Nj)  in  a  similar  manner  producing  (j  = 
2;i=l,2,  ...,Nj). 

Now,  having  transformed  each  observation,  we  classify  V  by  classifying  V 
according  to  the  GLRT  procedure  as  outlined  in  (MGW)  for  the  continuous-variables- 
only  case  with  missing  data  using  the  transformed  data  V  and  vP  (j  =  1,  2;  i  =  1,  2,  ..., 
Nj).  That  is,  we  proceed  as  if  V  and  VP  (j  =  1,  2;  i  =  1,  2,  ...,  Nj)  were  normally 
distributed,  ignoring  the  fact  that  many  of  the  components  are  binary.  In  simulation 
studies  (see  Section  6,  below),  we  see  that  this  method  actually  performs  about  as  well  as 
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methods  based  on  a  more  plausible  model  for  the  categorical  variables,  and  is  much 
easier  to  implement. 

4.  The  FULL  Algorithm 


Next,  we  derive  the  GLRT  procedure  using  a  more  plausible  model  for  the 
distribution  of  V.  In  this  case,  we  assume  that  the  distribution  of  X  follows  a  multinomial 
distribution  in  the  sense  that  Pr[X  =  x]  =  p^  for  each  x  e  'F,  and  the  conditional 
distribution  of  Y  given  X  =  x  is  multivariate  normal  with  mean  and  covariance  matrix 
and  Sjj.  Hence,  0  =  {p^^,  x  e  T}  and 

f(v|0)  =  p,MVN(y|p^,2,),  (9) 

where  MVN(y|p,j,  2^)  denotes  the  value  of  the  multivariate  normal  density  function  with 
parameters  p^^  and  2^  evaluated  at  y. 

The  first  step  in  deriving  the  GLRT  procedure  is  to  develop  the  EM  algorithm  for 
obtaining  MLEs  of  0  given  a  collection  of  observations  (Vj,  V2,  ...,  V^)  with  missing 
values  from  such  a  population.  The  vectors  Vj  may  be  partitioned  as  (Xj,  Yj),  where  Xj 
and  Yj  are  the  vectors  of  categorical  variables,  and  continuous  variables  respectively,  and 
further  partitioned  as  (Xj  X2i,  Yjj,  Y2j),  where  Xjj  and  Yjj  correspond  to  missing 
observations,  and  X2j  and  Y2j  correspond  to  available  observations.  (In  this  final 
partitioning,  the  dimensions  of  the  various  pieces  may  vary  with  i,  and  elements  may  be 
permuted  differently  for  each  i  according  to  the  pattern  of  missing  values  in  each 
observation.)  We  note  that  the  complete-data  sufficient  statistics  for  the  parameters  in 
this  model  are 


N,  =  i:".,i(Xi  =  x), 

S,  =  E;L,I(Xi  =  x)Yi,and  (i  e  4-) 
SS,^  =  Z;L,l(Xi  =  x)YiV^, 
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(10) 


and  that  the  MLEs  for  the  parameters  in  this  model  based  on  the  complete  data  are 

Pjj  =  and  (x  e  (11) 

A  A  A-T 

=  SSj/N^  ■  Px  Px  ■ 

The  M-step  in  this  setting  simply  amounts  to  evaluating  each  of  the  pieces  of  (1 1). 
The  main  computational  burden  lies  in  computing  the  conditional  expectations  of  the 
complete  data  sufficient  statistics  given  the  available  data  under  current  parameter 
estimates  in  each  iteration  (the  E-step). 

In  the  E-step,  we  wish  to  compute  (under  the  distribution  defined  by  0'^  ^) 


E[N,  I  {(Xji,  Yji),  i  =  1,  n}]  =  Z;L,E[I(Xi  =  x)  |  (X^,,  Yj^)],  (12) 

E[S,  I  ((Xji,  Yji),  i  -  1 ,  n)l  =  I".|E[I(Xi  =  x)Yi  |  (X^,,  Yj,)],  and  (x  s  '10 
E[SS,  I  {(Xj,,  Yji),  i  -  1,  n}]  =  i;’.,E[I(Xi  ■=  x)YiY;^  |  (Xj,.  Yj;)]. 

This  computation  is  facilitated  by  the  following  identities: 


E[I(X  =  x)h(Y)  I  (X2,  Y2)]  =  Pr[X  =  X  I  (X2,  Y2)]  •  E[h(Y)  |  X  =  x,  Y2] 

Prrx  =  xl«  Y  °  x,)p,MVN(y,  |  I,) 

Pr[X  I  (Xj,  Vj))  ^  X2)l>jMVN(y2  |  pj,  Ij) 

xeT 

E[Yi  I  X  =  X,  Y2]  =  V’(Y2  - 


(13) 

(14) 

(15) 


E[Y2|X  =  x,Y2]  =  Y2 
E[Y,Y^iX-x,Y2]  = 

Z^^  ‘I  Z^‘^^{zf  ^  +  E[Y,|  X  =  X,  Y2]E[Y,|  X  =  X,  Y2]'' 


E[Yj Y^  I  X  =  X,  Y2]  =  E[Y,  I  X  =  X,  Y2]  •  yJ 


(16) 

(17) 


(18) 
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(19) 


EiY^vl  I  X  -  X,  Y2]  =  Y2Y^ 

Here,  sf’l  and  are  appropriate  partitions  of  and  E^ 

corresponding  to  the  missing  and  available  parts  of  Y.  Equations  (15)  -  (19)  are  used  to 
estimate  the  missing  parts  of  E[Y  |  X  -  x,  Y2]  and  E[YY^  |  X  =  x,  Y2].  Computation  of 
the  expectations  in  (12)  is  then  carried  out  as  follows:  For  each  observation  in  the  data 
set,  compute  E[I(Xi  =  x)  |  (X2i,  Y2i)],  E[I(Xi  =  x)Yi  |  (X2i,  Y2i)],  and  E[I(Xj  =  x)YiYj^  1 
(X2j,  Y2i)]  for  each  x  €  T  via  equations  (13)  -  (19).  Accumulate  these  over  all 
observations  to  obtain  (12). 

In  the  case  of  continuous  variables  only,  (MOW)  used  estimates  of  parameters 
based  on  substituting  means  for  missing  observations  as  initial  estimates  in  the  iterative 
process.  This  becomes  more  complicated  in  the  presence  of  categorical  variables.  To 
simplify  initialization,  we  use  “blind  initialization”:  we  initialize  each  p^  with  1/r,  each 
u  with  0,  and  each  E„  with  I.  Experience  so  far  indicates  that  the  first  iteration  of  the 
EM  algorithm  substantially  alters  the  parameter  estimates  to  something  comparable  to 
“mean  substitution,”  if  that  means  anything  in  this  context.  In  any  case,  this  initialization 
procedure  has  worked  adequately  so  far  in  simulation  studies. 

Having  evaluated  the  MLEs  using  the  EM  algorithm,  we  need  a  method  for 
evaluating  the  maximized  log-likelihood  functions  in  equation  (7).  The  likelihood 
function  for  the  available  data  is  the  product  of  the  likelihoods  of  the  available  parts  of 
each  observation.  The  likelihood  of  the  available  part  of  a  single  observation  is  the 
marginal  density  for  (X2,  Y2).  This  may  be  obtained  from  the  density  for  (X,  Y)  by 
integrating  out  Xj  and  Y  j .  This  gives 

~  ^  ^^*2  ~  *2)  Px  MVN(y2  I  E^^).  (20) 
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The  maximized  log-likelihood  of  the  available  data  in  a  sample  is  obtained  by 
accumulating  the  values  of  the  log  of  (20)  over  all  observations  in  the  sample.  Thus,  we 
may  evaluate  each  of  the  pieces  X,qj,  Xq2,  and  ?ij2  in  equation  (7),  from  which  we 
may  evaluate  the  test  statistic  X  ^  log(LR)  given  in  (6). 

As  will  be  seen  in  the  simulation  results  of  Section  6,  the  FULL  Algorithm  has 
one  major  flaw  that  must  be  addressed.  That  is,  it  can  only  be  used  in  cases  in  which 
there  is  adequate  representation  in  each  cell  to  obtain  a  full-rank  estimate  of  for  each  x 
e  'P  in  both  populations.  If  r  is  large,  i.e.,  if  there  are  a  large  number  of  categorical 
variables,  or  a  large  number  of  categories  within  some  categorical  variables,  or  both,  then 
the  training  samples  may  need  to  be  extremely  large  so  that  all  parameters  can  be 
estimated  accurately.  In  practice,  such  large  samples  may  not  be  available,  and  it 
becomes  necessary  to  impose  further  constraints  on  the  parameters  of  the  model  so  that 
the  number  of  parameters  required  is  reduced.  This  leads  us  into  our  discussion  of  the 
next  algorithm. 

5.  The  COMMON  Algorithm 

This  last  algorithm  is  very  similar  to  the  FULL  Algorithm  except  that  in  our 
model  for  the  data,  we  assume  that  the  conditional  covariance  matrix  for  the  continuous 
part  given  the  discrete  part  is  common  for  all  x  e  T.  That  is,  the  conditional  distribution 
of  Y  given  X  =  x  is  multivariate  normal  with  mean  and  covariance  matrix  and  E  not 
depending  on  x.  Hence,  0  =  (p^^,  p^,  E;  x  e  'F}.  This  reduces  the  number  of  parameters 
that  need  to  be  estimated  considerably,  and  makes  parameter  estimation  possible  when 
some  cells  are  sparse,  or  not  represented  at  all.  We  allow  the  possibility  of  different 
parameters  for  each  of  the  two  populations,  but  within  each  population,  E  is  common 
across  all  multinomial  cells.  This  model  gives  precisely  the  general  location  model  of 
Olkin  and  Tate  (1961).  The  EM  algorithm  for  this  model  is  developed  by  Little  and 
Schluchter  (1985),  and  they  point  out  that  this  can  be  used  to  implement  the  GLRT 
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procedure  proposed  by  Krzanowski  (1982).  What  follows  is  precisely  this  procedure, 
with  the  added  feature  that  we  bootstrap  the  distribution  of  the  test  statistic  in  order  to 
choose  critical  values  to  control  the  P(2|l)  error  rate.  Although  Little  and  Schluchter 
(1985)  describe  the  EM  algorithm  for  this  model  in  considerable  detail,  we  present  a 
description  of  the  algorithm  here  that  is  consistent  with  the  notation  of  Section  4. 

First,  we  observe  that  the  complete-data  sufficient  statistics  for  the  parameters  in 
this  model  are 


N,  =  E;LiI(Xi  =  x), 

I(Xi  =  x)Yi,  and  (x  €  T)  (21) 

SS  =  YjY^, 

and  that  the  MLEs  for  the  parameters  in  this  model  based  on  the  complete  data  are 

Px  =  N,,/n, 

^^  =  S^/N^,and  (xen  (22) 

A  A  A 

E  —  EjjgvpPjjEj, 

A 

where  is  given  by  equations  (10)  and  (11).  In  other  words,  the  MLE  of  E  in  this  case 
is  precisely  a  weighted  average  of  MLEs  of  E^  for  each  x  based  on  the  FULL  model  with 
weights  p^.  Hence,  we  may  perform  the  M-step  in  this  algorithm  with  exactly  the  same 

A 

formulas  as  the  M-step  in  the  FULL  Algorithm,  except  that  after  each  E^  is  computed,  we 
average  these  according  to  (22)  to  obtain  the  updated  estimate  of  the  common  E.  The 
E-step  for  this  algorithm  is  also  identical  to  the  E-step  in  the  FULL  Algorithm,  except 

A  A 

that  throughout  formulas  (13)  -  (19),  each  E^^  is  replaced  by  the  common  E.  The 
evaluation  of  the  maximized  log-likelihood  functions  in  (7)  is  also  performed  using  (20), 

A 

as  in  the  FULL  algorithm,  with  again,  the  only  difference  being  that  each  E^^  is  replaced 

A 

by  the  common  E. 
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6.  Simulation  Results 


We  have  performed  simulations  of  each  of  the  three  algorithms  (INDICATOR, 
FULL,  and  COMMON)  based  on  several  different  parameter  configurations  in  order  to 
determine  how  well  the  algorithm  controls  the  Type  I  misclassification  probability  as 
desired,  and  to  assess  the  power  P(2j2)  of  each  algorithm.  We  also  keep  track  of  how 
many  times  the  algorithm  fails  to  classify  the  observation  at  all.  These  failures  occur 
when  for  some  reason  the  simulated  data  fails  to  yield  full-rank  estimates  of  all  required 
covariance  matrix  parameters.  This  results  in  an  undefined  test  statistic  X.  This  happens 
most  frequently  in  the  FULL  algorithm,  and  is  caused  by  a  very  few  number  of 
observations  falling  into  one  or  more  of  the  multinomial  cells.  It  happens  occasionally  in 
the  INDICATOR  algorithm  when  at  least  one  possible  value  of  a  categorical  variable  is 
not  represented.  Failures  may  occur  when  the  test  statistic  is  undefined  for  the  sample 
which  we  are  trying  to  classify,  and  also  when  the  statistic  is  undefined  for  attempted 
bootstrap  samples.  We  see  in  our  simulations  that  the  COMMON  algorithm  is  least 
susceptible  to  these  types  of  failures. 

Our  first  simulation  involved  the  same  parameter  configurations  used  in  BGWMF 

(Case  3:  Mixture  of  Categorical  and  Continuous  Variables).  That  is,  we  consider  the  case 

in  which  the  categorical  part  is  a  single  Bernoulli  variable  and  the  continuous  part  is  a 

single  normal  random  variable  independent  of  the  categorical  variable.  For  population 

Ttj,  the  Bernoulli  parameter  is  pj  =  0.1.  The  mean  and  variance  of  the  continuous 

2 

variable  are  P|  =  0  and  CTj  =  0.5,  respectively.  For  population  712,  we  use  P2  =  0.9,  0.7, 

2  2 

and  0.5,  =  LO,  and  ^2  ^  ^  takes  on  values  0,  1,  2,  and  3.  The 

observed  significance  level  P(2|l)  is  the  proportion  of  times  out  of  500  simulated  trials  in 

which  the  variable  V  is  classified  into  712  when,  in  fact,  it  was  simulated  from  71  j.  The 

estimated  power  P(2|2)  is  the  proportion  of  times  out  of  500  simulated  trials  in  which  the 

variable  V  is  classified  into  7C2  when,  in  fact,  it  was  simulated  from  712.  In  order  to 

achieve  an  approximate  significance  level  of  a  =  0.05,  the  variable  V  was  classified  into 
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♦  * 
7^2  if  the  test  statistic  X  is  less  than  or  equal  to  ,  the  0.05-th  empirical  quantile  of  {X^, 

^2’  ■■■  ’ 

For  Nj  =  N2  =  50  and  B  =  99,  the  power  estimates  are  plotted  in  Figure  5,  based 
on  simulations  with  no  missing  data.  We  see  that  the  FULL  and  COMMON  algorithms 
agree  very  well  with  the  power  curves  plotted  in  BGWMF  (Figure  2).  In  fact,  with  no 
missing  data,  the  COMMON  algorithm  is  essentially  equivalent  to  the  method  of 
BGWMF,  so  these  simulation  results  should  agree  very  well,  as  they  do.  The 
INDICATOR  algorithm  does  not  agree  well  with  the  FULL  and  COMMON  algorithms. 
For  this  reason,  the  points  corresponding  to  the  INDICATOR  algorithm  are  not  connected 
with  lines,  since  this  would  clutter  the  plot.  It  would  seem  that  the  INDICATOR 
algorithm  has  higher  power  in  general  than  the  other  two.  This  is  surprising  since  this 
algorithm  does  not  model  well  the  true  distribution  of  the  binary  variable.  A  closer 
examination  of  the  simulation  results  shows  that  this  is,  in  fact,  misleading,  since  the 
INDICATOR  tends  to  yield  a  significance  level  nearly  twice  the  desired  0.05  level.  This 
can  be  seen  in  Figure  6,  which  shows  the  power  estimate  plotted  versus  the  observed 
significance  level.  Each  plot  in  Figure  6  corresponds  to  a  specific  value  of  A.  We  can 
also  see  that  the  COMMON  algorithm  most  accurately  achieves  the  desired  a  =  0.05 
significance  level. 

In  Figures  7  and  8,  we  show  corresponding  plots  based  on  data  with  missing 
values.  In  these  simulations,  each  variable  in  each  observation  was  deleted  independently 
with  probability  0.1,  so  that  roughly  10%  of  the  data  is  missing.  We  see  an  overall 
decrease  in  the  power  of  all  three  algorithms  compared  to  the  full-data  case,  but  this  is  to 
be  expected  since  the  test  is  based  on  less  available  data.  Otherwise,  the  results  of  the 
missing-data  case  are  comparable  to  the  results  of  the  full-data  case. 

We  have  tabulated  the  results  of  this  simulation  in  Table  1 .  The  ERROR  column 

shows  the  percentage  of  times  out  of  the  500  simulations  that  the  algorithm  failed  to 

classify  V  due  to  singular  parameter  estimates.  We  see  that  the  FULL  algorithm  is  most 
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PROGRAM  CODE:  I  -  INDICATOR  F  -  FULL  C  -  COMMON 

Table  1 .  Listing  of  results  from  first  simulation. 


susceptible  to  failures  of  this  sort,  failing  as  much  as  6  to  7%  of  the  time  when  P2  =  0.9 
and  no  data  is  missing.  The  INDICATOR  algorithm  failed  somewhat  less  frequently,  and 
the  COMMON  algorithm  never  failed  in  this  study. 

In  our  next  simulation  study,  we  consider  the  case  in  which  we  have  two 
categorical  variables,  each  with  two  categories,  and  two  continuous  variables.  For 
population  Ttj,  each  possible  combination  of  the  categorical  part  (X  =  (1,1),  (1,2),  (2,1) 
and  (2,2))  occurs  with  probability  1/4.  The  conditional  distribution  of  the  continuous  part 
is  MVN(0,  Sj),  where 


1  0.5 

0.5  1 


(23) 


within  each  multinomial  cell  (i.e.,  conditional  on  each  possible  value  of  the  discrete  part). 
For  population  712,  the  conditional  covariance  matrix  for  the  continuous  part  is  I.2,  where 
S2  is  given  by 


1  -0.5 

-0.5  1 


(24) 


We  use  three  different  probability  distributions  for  the  discrete  part,  and  four  different 
configurations  of  mean  vectors  for  the  conditional  distributions  of  the  continuous  part 
given  each  possible  discrete  part.  In  the  plots  and  tables  which  follow,  the  three 
probability  distributions  are  coded  with  the  variable  PCODE,  which  takes  on  values  1,  2, 
and  3.  The  four  mean  vector  configurations  are  coded  with  the  variable  MCODE,  which 
takes  on  values  1,  2,  3,  and  4.  The  parameter  configurations  defined  by  these  codes  are 
shown  in  Table  2. 
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Table  2.  Definitions  for  parameter  codes  used  in  out  second  simulation  study. 

PCODE  =  1  corresponds  to  a  uniform  distribution  across  all  multinomial  cells.  PCODE 
=  2  and  PCODE  =  3  correspond  to  distributions  increasingly  favoring  cell  (1,1). 
MCODE  =  1  corresponds  to  a  mean  configuration  identical  to  that  for  population  tij. 
MCODE  =  2  and  MCODE  =  3  correspond  to  changes  in  mean  for  certain  cells,  and 
MCODE  =  4  corresponds  to  the  sum  of  these  two  changes.  For  PCODE  =  1  and 
MCODE  =  1,  population  712  is  identical  to  population  tij  except  for  the  correlation 
between  the  two  continuous  variables. 

As  in  our  first  study,  we  take  Nj  =  N2  =  50,  B  =  99,  a  =  0.05,  and  base  our 
observed  significance  level  and  power  estimates  on  500  replications  of  the  procedure  in 
each  case.  Figure  9  shows  the  power  estimates  plotted  versus  the  mean  configuration 
when  no  data  is  missing.  Figure  10  shows  plots  of  the  power  estimate  versus  the 
observed  significance  level  for  each  mean  configuration.  Figures  11  and  12  are 
corresponding  plots  for  approximately  10%  missing  data,  with  data  deleted  at  random  in 
the  same  manner  as  our  previous  study.  Table  3  shows  a  listing  of  these  results, 
including  the  percentages  of  failures  due  to  singular  parameter  estimates. 

In  Figure  9,  we  see  the  power  increases  in  general  as  the  separation  between  in 
means  increases  {i.e.,  as  MCODE  changes  from  1  to  4).  MCODE  =  2  and  MCODE  =  3 
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Table  3.  Listing  of  results  from  second  simulation  study. 


actually  correspond  to  the  same  degree  of  difference  in  means,  so  the  power  for  these  are 
not  expected  to  be  too  different.  In  fact,  the  power  estimates  for  MCODE  =  2  and 
MCODE  =  3  are  very  similar  when  PCODE  =  1 .  However,  they  are  not  very  similar 
when  PCODE  =  2  or  3.  In  this  case,  power  is  lower  for  MCODE  =  3  than  for  MCODE  = 
2.  This  results  since  for  MCODE  =  3,  the  means  differ  in  sparse  cells,  whereas  for 
MCODE  =  2,  the  means  differ  most  in  the  most  common  cell  (corresponding  to  X  = 
(1,1)),  making  it  more  easy  to  differentiate  between  the  two  populations.  We  see  similar 
patterns  in  Figure  11,  and  can  also  see  a  general  decrease  in  power  due  to  missing  values. 

These  plots  seem  to  indicate  that  the  INDICATOR  and  COMMON  algorithms 
have  very  similar  power,  these  being  generally  better  than  the  FULL  algorithm.  As  in  our 
first  study,  we  notice  in  Figures  10  and  12  that  the  INDICATOR  algorithm  has  a 
tendency  to  yield  a  higher  significance  level  than  desired,  especially  when  PCODE  =  3 
(i.e.,  when  some  cells  are  very  sparse). 

We  see  from  Table  3  that  the  INDICATOR  algorithm  fails  occasionally  due  to 
singular  covariance  matrix,  especially  when  PCODE  =  3.  The  FULL  algorithm  does 
much  worse  when  cells  are  sparse.  The  FULL  algorithm  fails  about  two-thirds  of  the 
time  when  PCODE  =  3  and  data  is  missing!  When  some  cells  occur  with  very  low 
probability,  it  is  necessary  to  have  very  large  samples  so  that  each  cell  is  represented 
enough  to  obtain  a  full-rank  estimate  of  the  covariance  matrix  within  that  cell.  Samples 
of  size  50,  cells  are  not  adequately  represented  about  two-thirds  of  the  time.  Once  again, 
we  see  that  the  COMMON  algorithm  is  least  susceptible  to  failures  due  to  singular 
parameter  estimates. 

Readers  may  wonder  why  the  algorithm  doesn't  fail  every  time  for  PCODE  =  3 

since  cell  (1,1)  is  never  represented.  If  a  cell  probability  is  estimated  to  be  zero,  the 

covariance  matrix  estimate  for  that  cell  is  never  used  in  the  computation  of  X,  and  can, 

therefore,  be  disregarded.  It  is  not  cell  (1,1)  that  is  the  problem  here,  rather  it  is  cells 

(0,1)  and  (1,0).  Readers  may  also  find  it  strange  that  for  PCODE  =  2,  there  are  fewer 
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failures  in  the  FULL  algorithm  when  data  is  missing  than  when  all  data  is  available.  This 
may  be  explained  intuitively  as  follows:  When  data  is  missing  in  the  discrete  part,  there 
is  some  possibility  that  the  observation  falls  into  any  of  a  number  of  cells.  This 
observation  contributes  to  the  parameter  estimates  for  all  cells  to  which  the  observation 
might  truly  belong,  resulting  in  fewer  rank  problems  in  sparse  cells. 

7.  Concluding  Remarks 

In  this  report,  we  have  extended  the  results  of  BGWMF  and  MGW  to  perform 
discriminant  analysis  with  categorical  and  continuous  variables  when  data  is  missing. 
We  presented  three  algorithms  for  doing  so.  In  simulation  studies,  we  have  observed  that 
the  INDICATOR  algorithm  has  a  tendency  to  yield  a  higher  Type  I  error  rate  than 
desired.  The  FULL  algorithm  often  fails  due  to  singular  parameter  estimates  when  some 
value  of  the  discrete  part  is  sparsely  represented.  The  COMMON  algorithm  seems  to 
avoid  these  problems,  and  is,  therefore,  the  preferred  algorithm,  especially  when  samples 
are  small  and  the  assumption  of  a  common  covariance  matrix  across  all  multinomial  cells 
is  reasonable.  The  code  has  now  been  transferred  to  MRC  and  Dr.  Mark  Fisk  is  applying 
these  techniques  to  some  existing  seismic  data. 
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Figure  1 .  Flowchart  for  the  GLRT  procedure  for  discriminant  analysis 


Figure  2.  Flowchart  for  the  bootstrapping  portion  of  the  GLRT  procedure. 
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Figure  3.  Flowchart  for  computing  the  test  statistic  X  in  the  GLRT  procedure. 


Figure  4.  Flowchart  for  the  EM  algorithm. 


24 


DELTA 


Figure  5.  Power  estimates  when  no  data  is  missing  for  each  of  the  three  algorithms,  wdth 
data  modeled  as  a  Bernoulli  random  variable  and  an  independent  normal  random  variable. 

2 

Parameters  for  population  Ttj  are  pj  =  0.1,  pj  =  0,  and  Oj  =  0.5.  Power  estimates  are 

2 

based  on  the  following  configurations  for  population  ^2'.  P2  =  0.9,  0.7,  and  0.5,  ^2  =  1.0, 
2 

and  \i2  =  0.5  +  Aa2,  where  A  takes  on  values  0,  1,  2,  and  3.  For  each  value  of  A,  the 
symbols  I,  F,  and  L  are  plotted  at  the  corresponding  power. 
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Figure  7.  Power  estimates  when  approximately  10%  of  the  data  is  missing  for  each  of  the 
three  algorithms,  with  data  modeled  as  a  Bernoulli  random  variable  and  an  independent 


2 

normal  random  variable.  Parameters  for  population  ttj  are  pj  =  0.1,  pj  =  0,  and  aj  =  0.5. 


Power  estimates  are  based  on  the  following  configurations  for  population  ny.  P?  ~ 
2  2 

0.7,  and  0.5,  02  =  1.0,  and  1X2  =  0.5  +  A02,  where  A  takes  on  values  0,  1,  2,  and  3. 
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Figure  9.  Power  estimates  when  no  data  is  missing  for  each  of  the  three  algorithms  when 
data  have  two  binary  and  two  continuous  variates,  possibly  dependent.  For  population  tcj, 
each  possible  combination  of  the  binary  part  occurs  with  probability  1/4.  TTie  conditional 
distribution  of  the  continuous  part  is  MVN(0,  Sj),  where  Zj  is  a  2x2  matrix  with  diagonal 
elements  of  one  and  off-diagonal  elements  of  0.5,  within  each  multinomial  cell.  For 
population  7t2,  the  conditional  covariance  matrix  for  the  continuous  part  is  Z2>  where  Z2 
has  ones  on  the  diagonal  and  off-diagonal  elements  of  -0.5.  Several  distributions  for  the 
discrete  part,  and  several  choices  of  mean  vectors  are  used,  as  defined  in  Table  2. 
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PWR 


Figure  10.  Plots  of  estimated  power  versus  observed  significance  for  each  mean  configuration  in  our  second  study  when  no  data  is 
missing.  Symbol  used  is  I  for  INDICATOR  algorithm,  F  for  FULL,  and  C  for  COMMON.  The  solid  line  corresponds  to  PCODE  =  1, 
the  dashed  line  to  PCODE  =  2,  and  the  dotted  line  to  PCODE  =  3. 
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Figure  11.  Power  estimates  when  approximately  10%  of  the  data  is  missing  for  each  of 
the  three  algorithms  when  data  have  two  binaiy  and  two  continuous  variates,  possibly 
dependent.  For  population  Ttj,  each  possible  combination  of  the  binary  part  occurs  vvdth 
probability  1/4.  The  conditional  distribution  of  the  continuous  part  is  MVN(0,  Zj),  where 
Zj  is  a  2x2  matrix  with  diagonal  elements  of  one  and  off-diagonal  elements  of  0.5,  within 
each  multinomial  cell.  For  population  712,  the  conditional  covariance  matrix  for  the 
continuous  part  is  Z2,  where  Z2  has  ones  on  the  diagonal  and  off-diagonal  elements  of 
-0.5.  Several  distributions  for  the  discrete  part,  and  several  choices  of  mean  vectors  are 
used,  as  defined  in  Table  2. 
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(c)  MC0DE  =  3  (d)  MC0DE  =  4 

Figure  12.  Plots  of  estimated  power  versus  observed  significance  for  each  mean  configuration  in  our  second  study  when  10%  of  the 
data  is  missing.  Symbol  used  is  I  for  INDICATOR  algorithm,  F  for  FULL,  and  C  for  COMMON.  The  solid  line  corresponds  to 
PCODE  =  1,  the  dashed  line  to  PCODE  =  2,  and  the  dotted  line  to  PCODE  =  3. 
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